The GEA pipeline for characterizing Escherichia coli and Salmonella genomes

Salmonella enterica and Escherichia coli are major food-borne human pathogens, and their genomes are routinely sequenced for clinical surveillance. Computational pipelines designed for analyzing pathogen genomes should both utilize the most current information from annotation databases and increase the coverage of these databases over time. We report the development of the GEA pipeline to analyze large batches of E. coli and S. enterica genomes. The GEA pipeline takes as input paired Illumina raw reads files which are then assembled followed by annotation. Alternatively, assemblies can be provided as input and directly annotated. The pipeline provides predictive genome annotations for E. coli and S. enterica with a focus on the Center for Genomic Epidemiology tools. Annotation results are provided as a tab delimited text file. The GEA pipeline is designed for large-scale E. coli and S. enterica genome assembly and characterization using the Center for Genomic Epidemiology command-line tools and high-performance computing. Large scale annotation is demonstrated by an analysis of more than 14,000 Salmonella genome assemblies. Testing the GEA pipeline on E. coli raw reads demonstrates reproducibility across multiple compute environments and computational usage is optimized on high performance computers.

The GEA pipeline for characterizing Escherichia coli and Salmonella genomes Aaron M. Dickey 1* , John W. Schmidt 1 , James L. Bono 1 & Manita Guragain 2* Salmonella enterica and Escherichia coli are major food-borne human pathogens, and their genomes are routinely sequenced for clinical surveillance.Computational pipelines designed for analyzing pathogen genomes should both utilize the most current information from annotation databases and increase the coverage of these databases over time.We report the development of the GEA pipeline to analyze large batches of E. coli and S. enterica genomes.The GEA pipeline takes as input paired Illumina raw reads files which are then assembled followed by annotation.Alternatively, assemblies can be provided as input and directly annotated.The pipeline provides predictive genome annotations for E. coli and S. enterica with a focus on the Center for Genomic Epidemiology tools.Annotation results are provided as a tab delimited text file.The GEA pipeline is designed for large-scale E. coli and S. enterica genome assembly and characterization using the Center for Genomic Epidemiology command-line tools and high-performance computing.Large scale annotation is demonstrated by an analysis of more than 14,000 Salmonella genome assemblies.Testing the GEA pipeline on E. coli raw reads demonstrates reproducibility across multiple compute environments and computational usage is optimized on high performance computers.
Salmonella enterica (hereafter Salmonella) are estimated to cause at least 1 million illnesses in the United States each year 1 .Escherichia coli are ubiquitous in a wide variety of environments relevant to food safety including food animal gastrointestinal systems, animal production sites, human gastrointestinal systems, meats, and manure impacted soils.A small but clinically important sub-set of E. coli are pathogenic.The ubiquitous nature of E. coli contributes to their relevance beyond food safety.Due to their prominence and small genome size, Salmonella and E. coli are also two of the top organisms with available whole genome sequencing read archives (https:// www.ncbi.nlm.nih.gov/ sra? term= (% 22pub lic% 22% 5BAcc ess% 5D)% 20AND% 20% 22gen omic% 22% 5BSou rce% 5D) and assemblies (https:// www.ncbi.nlm.nih.gov/ genome/ browse# !/ overv iew/).Such large datasets often rely on high-performance computing to accelerate computational tasks via increased RAM, threads, and parallelization 2 .Large datasets can benefit from data analysis pipelines, which process many input files with an initial set of user specifications and distill the results to a small number of organized outputs for interpretation 3 .
Useful pipelines for epidemiologic annotation should take advantage of the most up-to-date reference information available.Actively curated reference databases meet this need by rapidly incorporating newly released genomic data.The interplay between these two dependencies can be thought of as a positive feedback loop wherein 1.Running the pipeline on new strains improves the database coverage and quality by exposing knowledge gaps and 2. The database improvement leads to more accurate search hits when running the pipeline on new strains.
Here, we introduce the GEA pipeline.GEA stands for Gammaproteobacteria Epidemiologic Annotation.Analyses central to the GEA pipeline are those using Center for Genomic Epidemiology (CGE) developed tools 4 .The databases for these tools to search are updated frequently, facilitating the positive feedback loop between our pipeline and these databases.Several existing pipelines utilize CGE developed tools [5][6][7][8][9] .But we are unaware of any other published pipelines, which use FimTyper 10 , MLST 11 , PlasmidFinder 12 , ResFinder 13 , SerotypeFinder 14 , and VirulenceFinder 15 in tandem.
Another important feature of the pipeline is the use of a container.Containers allow for compute mobility 16 and provide an increased level of reproducibility 17 .The container housing the software tools for running the GEA pipeline also takes advantage of the Scientific File System (SCIF) 18 , providing independent mount points to different apps in the container with incompatible environmental requirements.
Prior versions of the GEA pipeline have been used in published research [19][20][21] in food safety, risk assessment, antimicrobial resistance gene transfer, and virulence research at the US Meat Animal Research Center.This demonstrates the utility of the pipeline and the benefit of the distilled annotation summary across hundreds of genomes.This paper describes the methods used for creating the pipeline and provides a kilo-scale demonstration of the pipeline on S. enterica assemblies.The pipeline is available from github.com/Phylloxera/GEA-dev.

Output
The principal output of the GEA pipeline is the tab-delimited file, metadata.txt.The metadata can be opened in Excel.The number of rows corresponds to the number of assemblies annotated and the number of columns is dataset dependent.The annotation quality is expected to improve as the CGE database coverage improves over time.The metadata from the test dataset of 96 E. coli raw read libraries contained 444 columns and the metadata from the demonstration dataset of 14,310 Salmonella assemblies had 597 columns.Table 2 summarizes the components of the two metadata files.Tables 3 and 4 provide single annotation snapshots from VirulenceFinder 15 E. coli and ResFinder 13 Salmonella respectively, while Table 5 provides a snapshot of the all tools summary from E. coli.Table 5 suggests a relationship between FIM type and Antimicrobial Resistance Gene (ARG) content for the 96 E. coli libraries test dataset.The complete summary output file (GEA_ecoli_test_Ceres_metadata.txt) from USDA Ceres 22 is provided in Supplementary Data S1.

Demonstration
The pipeline ran on the 14,310 S. enterica assemblies in ~ 72 h (~ 18 s-per-S.-enterica-assembly)on the USDA/ Mississippi State University Atlas cluster 22 .The complete summary output file is in GEA_senterica_demo_Atlas_ metadata.txt in Supplementary Data S1.

Discussion
A diverse set of bioinformatic tools has been developed for phenotypic prediction based on genomic data, especially for human pathogens.These tools often grow out of the requirements sought by a group of researchers.In the case of GEA, these included the desire to assemble and run CGE tools at the command-line on large numbers of strains with a single summary output and to have genome assemblies in a single directory ready for submission to NCBI.
Users have reported different results with the same input data, sometimes with analyses conducted many months apart.This is suspected to be caused by updates in the actively curated CGE databases.This has been confirmed in some instances.In the interest of reproducibility, the user has the option to update their local copy of the CGE databases and use the most up-to-date versions, or to leave the databases static for reproducibility across independent computational runs.Reproducibility has long been an aspiration of scientific analysis, however database dependent analyses may demonstrably benefit from non-reproducibility as database coverage increases with the passage of time.
In our testing phase, we sought to have reproducibility across multiple computational environments and this was largely achieved.The only difference in the outputs across high performance computers was due to the contig names assigned by shovill (https:// github.com/ tseem ann/ shovi ll).In all these cases, the length, coverage, and Pilon 23 name were identical whereas shovill assigned contig integers differed by 1. Furthermore, shovill contig names included the date assembled, an additional possible source of discrepancy for runs taking place on different days.E.g., contig00205 len = 509 cov = 31.1 corr = 0 origname = NODE_348_length_509_ cov_31.119617_pilonsw = shovill-spades/1.1.0date = 20231107 on Ceres vs contig00204 len = 509 cov = 31.1 corr = 0 origname = NODE_348_length_509_cov_31.119617_pilonsw = shovill-spades/1.1.0date = 20231101 on Moose (discrepancies in bold; example from row 15, column 128 of GEA_ecoli_test_Ceres_metadata.txt in Supplementary Data S1).The identical coverage and Pilon designation indicates that, in all cases, these were identical contigs, but that the final integer contig name assignment in shovill may not be deterministic.Additionally, E. coli test data Assembly_bp and Ncontigs statistics were identical across computing environments.The other discrepancy was caused by insufficient memory being available to Skesa 24 on the Desktop computer causing two libraries to not assemble resulting in missing Skesa plasmid annotations (rows 33 and 35, columns 21 and 22 of GEA_ecoli_test_Ceres_metadata.txt in Supplementary Data S1).Importantly, these assembly failures were documented by the GEA pipeline log, which alerted the Desktop user.Apart from these two discrepancy  GEA has clear advantages and limitations relative to tools with similar goals.First, long available tools, such as nullarbor (https:// github.com/ tseem ann/ nulla rbor) and TORMES 7 have the distinction of an active user base, citations, and more time under development.Bacannot 25 is a newer tool, which is container based like GEA.Software containerization increases reproducibility over OS specific source-compile-install-run and cross-platform package manager methodologies 17 .RSYD-BASIC 26 is also a newer tool which produces a tabular output somewhat like that produced by GEA.The lack of a web server option is a limitation of GEA.However, having a batch command-line implementation of CGE tools was a central functionality driving the development of GEA.GEA also lacks phylogenetic methods, except to the extent that typing predictions are phylogenetically informative.Currently, GEA is only indicated for E. coli and S. enterica.The strongest advantages of GEA are a single dependency (Apptainer 27 ), batch processing in an HPC environment proven by hundreds of successful analyses of illumina raw read libraries 21 , and the successful demonstration at the kilo-scale of a processing rate of ~ 18-s-per-S.-enterica-assembly as demonstrated in this report.
Shovill and Skesa can both handle low levels of contamination.The pipeline has been tested with a diversity of libraries from three different Illumina sequencing platforms at the USMARC Core Lab, but exhaustive testing on the types and degrees of contamination has not been conducted.Regardless, preprocessing or quality control of libraries should be unnecessary.
The Gammaproteobacteria name derivation of the GEA pipeline implies a much broader set of pathogens than are currently included.One noticeable impact of incorporating new species is a substantially smaller set of annotations for the added species, at least initially.This is because all other species have fewer applicable CGE tools relative to E. coli.This is surmountable in instances where there is a non-CGE tool available for the task (e.g.GEA uses SeqSero2 28 for serotyping Salmonella since the CGE tool, SerotypeFinder 14 , does not serotype Salmonella).Other enhancements, which could be incorporated into GEA in the future include long-read sequence inputs, identification of additional genotypes/phenotypes as new CGE tools are released, and downsampling reads to accelerate Skesa assembly as is done with shovill.GEA is now available for other users with institutional HPCs for rapid characterization of large batches of E. coli and Salmonella genomes from diverse sample sources.GEA is available for download from github.com/Phylloxera/GEA-dev.

Pipeline
The GEA pipeline implements the following steps in sequential order (Fig. 1).GEA processes the user commandline options and inputs.If the inputs are raw reads, the workflow proceeds to Assembly.
1. Assembly is first carried out by Shovill(https:// github.com/ tseem ann/ shovi ll) followed by SKESA 24 .Skesa is used for identifying complete plasmids and can circularize some novel small plasmids not yet on the plasmidfinder database.2. Epidemiologic Prediction is conducted on shovill assemblies or user supplied fasta contigs.Local copies of the CGE tool and 5 loci databases (https:// github.com/ Phyll oxera/ 5loci) are updated by the pipeline unless  3. The results are compiled and written to the tab delimited output file, metadata.txt.GC and N50 summary statistics are calculated by stats.sh(https://jgi.doe.gov/ data-and-tools/ softw are-tools/ bbtoo ls/ bb-tools-userguide/ stati stics-guide/) during results compilation and additional summary statistics are extracted from the shovill logs.
GEA is written in Bash.The current version has several new features relative to prior development iterations used in previous work.
1.The user can specify the taxon.2. The user can specify whether to update their local copy of the databases.3. The input data can be raw gzipped paired reads (fastq) or genome assemblies (fasta).4. The following new tools have been added: Ezclermont, FimTyper, VirulenceFinder, and the 5 loci databases.5.The container recipe utilizes SCIF 18 for software environment modularity inside the software container.6.The container, and pipeline are made available via https:// github.com/ Phyll oxera/ GEA-dev.

Testing
GEA was tested on 3 linux high performance computers and a desktop computer with Hyper-V enabled on Windows 10 Professional (Table 1).The data used in testing were a single plate of 96 illumina E. coli raw read libraries from a long-term evolutionary study.Testing utilized the -u F option to query identical versions of the CGE databases and evaluate reproducibility across the compute environments.

Demonstration
To demonstrate the pipeline at kilo-scale, GEA was run on 14,310 Salmonella enterica genome assemblies released during October, 2023.The assemblies were downloaded on November 6, 2023 using datasets(https:// www.ncbi.nlm.nih.gov/ datas ets) with download genome options: taxon 28901,-include genome,-exclude-atypical,-released-after 10/1/2023,-released-before 10/31/2023,-assembly-source GenBank, and-dehydrated.Fasta files were moved to a single folder to be used as input.GEA was run on the Atlas high performance computer system of Mississippi State University and the US Department of Agriculture 22 on November 11, 2023, with options: -t senterica, -u F, -r 336:00:00, -m 360G, and -c 48.Initial tests predicted a run time of 2-5 days.
otherwise specified by the user.The databases are stored locally in the user's home directory (e.g.$HOME/ share/resfinder).BLAST29 is used as the search method for the CGE tools and to query the custom 5 loci databases.a. PlasmidFinder 12 (CGE) b.MLST 11 (CGE) is run.If the user does not specify the taxon, mlst is run against both databases and the species is determined by GEA.c. Serotyping is conducted using SerotypeFinder 14 (CGE-E.coli) or SeqSero2 28 (Salmonella).d.Resfinder4 13 (CGE) is run against resfinder_db and pointfinder_db databases.e.The 5 loci databases are queried with Blast.f.VirulenceFinder 15 (CGE-E.coli) g.Ezclermont 30 (E. coli) h.FimTyper 10 (CGE-E.coli)

Figure 1 .
Figure 1.Sequential workflow carried out by the GEA pipeline.

Table 1 .
Computational resources used for GEA Pipeline testing.

Table 2 .
Summary of the sections of the GEA Pipeline test data and demonstration data tabular output.96E.

Table 3 .
A 7-column VirulenceFinder annotation from a three assembly subset of the 96 E. coli libraries test data (Supplementary Data S1).

Table 4 .
A 7-column ResFinder annotation from a 4 assembly subset of the 14,310 Salmonella assemblies demonstration data (Supplementary Data S1).

Table 5 .
A 3-column portion of the GEA Pipeline All Tools Summary for the 96 E. coli libraries test data (Supplementary Data S1) showing acquired antimicrobial resistance gene content by fim type.79 libraries with FimH82 and 0 resistance genes are not shown.